MIXING LEAST-SQUARES ESTIMATORS WHEN THE VARIANCE IS 

UNKNOWN 

CHRISTOPHE GIRAUD 

»' ' Abstract. We propose a procedure to handle the problem of Gaussian regression when the 

' variance is unknown. We mix least-squares estimators from various models according to a 

procedure inspired by that of Leung and Barron [IT]. We show that in some cases the resulting 
^ ' estimator is a simple shrinkage estimator. We then apply this procedure in various statistical 

O I settings such as linear regression or adaptive estimation in Besov spaces. Our results provide 

. non-asymptotic risk bounds for the Euclidean risk of the estimator. 
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1. Introduction 



We consider the regression framework, where we have noisy observations 

(1) Yi = fii + asi, i = l,...,n 

of an unknown vector = (^i, . . . , G M". We assume that the e^'s are i.i.d standard 
fS| ' Gaussian random variables and that the noise level u > is unknown. Our aim is to estimate /i. 

I> ■ 

In this direction, we introduce a finite collection {Sm, m G Ai} of linear spaces of M", which we 
call henceforth models. To each model Sm, we associate the least-squares estimator fim = Hsm^ 
of fi on Snij where denotes the orthogonal projector onto Sm- The i^-risk of the estimator 
! Am with respect to the Euclidean norm || • || on M" is 

(2) E [11^ - fimf] = Wfi - Us^fif + dim(5^)a^ 



Two strategies have emerged to handle the problem of the choice of an estimator of /i in this 
5^ \ setting. One strategy is to select a model S^, with a data driven criterion and use fim to estimate 

fi. In the favorable cases, the risk of this estimator is of order the minimum over Ai of the risks 
Model selection procedures have received a lot of attention in the literature, starting from 
the pioneer work of Akaike [Ij and Mallows [T8]. It is beyond the scope of this paper to make an 
historical review of the topic. We simply mention in the Gaussian setting the papers of Birge 
and Massart [3 [8] (influenced by Barron and Cover [5j and Barron, Birge and Massart [3]) which 
give non-asymptotic risk bounds for a selection criterion generalizing Mallows'Cp. 
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An alternative to model selection is mixing. One estimates by a convex (or linear) combination 
of the jlmS 

(3) = Wmflm, 

rn£M 

with weights Wm which are C7(y)-measurable random variables. This strategy is not suitable 
when the goal is to select a single model Srh, nevertheless it enjoys the nice property that fi may 
perform better than the best of the fimS- Various choices of weights Wm have been proposed, 
from an information theoretic or Bayesian perspective. Risk bounds have been provided by 
Catoni [11], Yang |25^ I28j. Tsybakov [23j and Bunea et al. [9j for regression on a random design 
and by Barron [3], Catoni [10] and Yang [26j for density estimation. For the Gaussian regression 
framework we consider here, Leung and Barron [T7j propose a mixing procedure for which they 
derive a precise non-asymptotic risk bound. When the collection of models is not too complex, 
this bound shows that the risk of their estimator fl is close to the minimum over M of the 
risks dl]). Another nice feature of their mixing procedure is that both the weights Wm and the 
estimators fim are build on the same data set, which enable to handle cases where the law of the 
data set is not exchangeable. Unfortunately, their choice of weights Wm depends on the variance 
a"^, which is usually unknown. 

In the present paper, we consider the more practical situation where the variance is unknown. 
Our mixing strategy is akin to that of Leung and Barron [17J, but is not depending on the 
variance o"^. In addition, we show that both our estimator and the estimator of Leung and 
Barron are simple shrinkage estimators in some cases. From a theoretical point of view, we 
relate our weights Wm to a Gibbs measure on M and derive a sharp risk bound for the estimator 
jl. Roughly, this bound says that the risk of fl is close to the minimum over M of the risks ([2]) 
in the favorable cases. We then discuss the choice of the collection of models {5m, m S A4} in 
various situations. Among others, we produce an estimation procedure which is adaptive over 
a large class of Besov balls. 

Before presenting our mixing procedure, we briefly recall that of Leung and Barron As- 
suming that the variance o"^ is known, they use the weights 

(4) Wm = ^e^v{-fi[\\Y -fimf/a'^ + 2d{ui{Sm)-n]), meM 

where {vTm, m E A4} is a given prior distribution on M and Z normalizes the sum of the WmS 
to one. These weights have a Bayesian flavor. Indeed, they appear with /? = 1/2 in Hartigan 
[15] which considers the Bayes procedure with the following (improper) prior distribution: pick 
an m in according to TTm and then sample "uniformly" on Sm- Nevertheless, in Leung 
and Barron [17] the role of the prior distribution {tt^, rn G Ai} is to favor models with low 
complexity. Therefore, the choice of TTm is driven by the complexity of the model Sm rather than 
from a prior knowledge on fi. In this sense their approach differs from the classical Bayesian 
point of view. Note that the term ||y — /tmlP/o"^ + 2dim(5m) — n appearing in the weights ([4]) 
is an unbiased estimator of the risk ([2]) rescaled by cr'^. The size of the weight Wm then depends 
on the difference between this estimator of the risk ([2]) and — log(7rm), which can be thought 
as a complexity-driven penalty (in the spirit of Barron and Cover [5] or Barron et al. [^). The 
parameter /? tunes the balance between this two terms. For /3 < 1/4, Theorem 5 in [T7] provides 
a sharp risk bound for the procedure. 
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The rest of the paper is organized as fohows. We present our mixing strategy in the next section 
and express in some cases the resulting estimator /x as a shrinkage estimator. In Section 3, 
we state non-asymptotic risk bounds for the procedure and discuss the choice of the tuning 
parameters. Finally, we propose in Section 4 some weighting strategies for linear regression or 
for adaptive regression over Besov balls. Section 5 is devoted to a numerical illustration and 
Section 6 to the proofs. Additional results are given in the Appendix. 

We end this section with some notations we shall use along this paper. We write \m\ for the 
cardinality of a finite set m, and < x,y > for the inner product of two vectors x and y in M". 
To any real number x, we denote by (a;)+ its positive part and by [x\ its integer part. 



2. The estimation procedure 



We assume henceforth that n > 3. 



2.1. The estimator. We start with a finite collection of models {Sm, fn G A4} and to each 
model Sm we associate the least-squares estimator jlm = ^5,^^ of /i on Sm- We also introduce a 
probability distribution {vTm, m £ A^} on A^, which is meant to take into account the complexity 
of the family and favor models with low dimension. For example, if the collection {Sm, m £ Ai} 
has (at most) e"*^ models per dimension d, we suggest to choose TTm oc e(«-+i/2)dim(5m)^ ggg 
the example at the end of Section 13.11 As mentioned before, the quantity — log(7rm) can be 
interpreted as a complexity-driven penalty associated to the model Sm (in the sense of Barron 
et al. ^). The performance of our estimation procedure depends strongly on the choice of the 
collection of models {Sm-, "^ G M.} and the probability distribution {Tr^, rn G M.}. We detail 
in Section [3] some suitable choices of these families for linear regression and estimation of BV or 
Besov functions. 



Hereafter, we assume that there exists a linear space 5* C M" of dimension d^, < n, such that 
Sm C S-f for all m G A^. We will take advantage of this situation and estimate the variance of 
the noise by 

where A^* = n — d^,. We emphasize that we do not assume that /x G 5* and the estimator 
(T^ is (positively) biased in general. It turns out that our estimation procedure does not need 
a precise estimation of the variance cj^ and the choice ([5]) gives good results. In practice, we 
may replace the residual estimator by a difference-based estimator (Rice [20\ , Hall et al. [13] , 
Munk et al. [19] , Tong and Wang [22] , Wang et al. [29] , etc) or by any non-parametric estimator 
(e.g. Lenth [E]), but we are not able to prove any bound similar to (jl2p or (jl3p when using one 
of these estimators. 



Finally, we associate to the collection of models {Sm, m £ M}, a collection {Lm, m £ M} of 
non-negative weights. We recommend to set Lm = dim(5m)/2, but any (sharp) upper bound 
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of this quantity may also be appropriate, see the discussion after Theorem [TJ Then, for a given 
positive constant (3 we define the estimator jl by 



where Z \s a, constant that normahzes the sum of the to one. An alternative formula for Wm 
is Wm = 7r,„exp {-P\\Iis-,Y - ^mf/a"^ - Lm) /Z' with Z' = e-PW^s^^WV^^ z. We can interpret 
the term Ull^^y — + Lm/ (3 appearing in the exponential as a (biased) estimate of the 

risk ([2]) rescaled by a^. As in dH), the balance in the weight Wm between this estimate of the 
risk and the penalty — log(7rm) is tuned by f3. We refer to the discussion after Theorem [1] for 
the choice of this parameter. We mention that the weights {wm, m G A4} can be viewed as a 
Gibbs measure on Ai and we will use this property to assess the performance of the procedure. 

We emphasize in the next section, that /i is a simple shrinkage estimator in some cases. 



2.2. A simple shrinkage estimator. In this section, we focus on the case where A4 consists 
of all the subsets of {1, . . . for some p < n and Sm = span{vj, j € m} with {vi, . . . ,Vp} 
an orthonormal family of vectors in M". We use the convention 50 = {0}. An example of such 
a setting is given in Section 14.21 see also the numerical illustration Section [5l Note that 5* 
corresponds here to S^i ^y and = p. 

To favor models with small dimensions, we choose the probability distribution 



with a > 0. We also set Lm = b\m\ for some 5 > 0. 

Proposition 1. Under the above assumptions, we have the following expression for fx 



The proof of this proposition is postponed to Section 16.11 The main interest of Formula ([8]) is 
to allow a fast computation of fi. Indeed, we only need to compute the p coefficients Cj instead 
of the 2^ weights Wm of formula ([6]) . 




(7) 





The coefficients Cj are shrinkage coefficients taking values in [0, 1]. They are close to one when 
Zj is large and close to zero when Zj is small. The transition from to 1 occurs when Z? ~ 
j3~^{b+a logp)(T^. The choice of the tuning parameters a, j3 and b will be discussed in Section [3^ 
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Remark 1. Other choices are possible for {tt^, m G Ai} and they lead to different cjs. Let us 

-1 



mention the choice = y{p + j for which the cjs are given by 

So Q Uk^, [q+jl-q) exp +b)]dq . 
Q ■ ^ L_:i lor 7 L 7} 

' /onLi['?+(l-^)exp(-/3Z|/a2 + 6)]<i, ' 
This formula can be derived from the Appendix of Leung and Barron [17j . 

Remark 2. When the variance is known, we can give a formula similar to ^ for the estimator 
of Leung and Barron [17]. Let us consider the same setting, with p < n. Then, when the 
distribution {vr^, m £ Ai} is given by ([7]), the estimator ^ with weights Wm given by dH) takes 
the form 

P / g/3Z2/<x2 \ 

^ = — — I^V^ ^i- 



3. The performance 



3.1. A general risk bound. The next result gives an upper bound on the L^-risk of the 
estimation procedure. We remind the reader that n > 3 and set 



(10) 



: ]0,1[ ^ ]0,+oo[ 

X I— > (x — 1 — logx)/2 



which is decreasing. 

Theorem 1. Assume that (3 and fulfill the condition 



(11) 



/3<l/4 and iV* > 2 + 



logn 



with 4> defined by [T0\) . Assume also that Lm > dim{Sm)/'^, for all m G M.. Then, we have the 
following upper bounds on the Li^-risk of the estimator fx 



(12) 



(13) 



K{\\^i-fif) 

-2 

< -(l + e„) ^ log 



^ ^^^~fi[\\^l-Ils^^l\?-dim{Sn,)a^]/a■^-L„ 



.meM 



+ 



2 log n 



< (l+e„) inf <\\^-Us^iJ.\\ + — (Lm-logvr^ 

meM I p 



+ 



2 log n 



where e„ = (2?7,logn) ^ and a"^ = a'^ + \\fi — Ilg^ fiW^ / N.^, . 



The proof Theorem [T] is delayed to Section [6.31 Let us comment this result. 
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To start with, the Bound (jl2p may look somewhat cumberstone but it improves ()13p when there 
are several good models to estimate fi. For example, we can derive from (jl2p the bound 



(l + e„) inf <^ W^-Hs^^if + -T{Lm-^ogTT^) ) + inf {5 - — \0g\M5\ ) + 



meM [ j3 J 'S>o 1^ /? J 21ogn 

where is the set made of those m* in M fulfilling 

-2 r -2 

IIm - Hs^.^lP + -T {Lm' - logTTm*) < (5+ inf \ \\^ - Tismt^\? + -w {Lm - logvr„ 

In the extreme case where all the quantities \\^ — Yism^^\\'^ + ^ {^m — logTTm) are equal, (fT2]) 
then improves (fT3]) by a factor I5~^a'^ log |A^|. 

We now discuss the choice of the parameter (5 and the weights {1/^, m E A^}. The choice 
= dim(5m)/2 seems to be the more accurate since it satisfies the conditions of Theorem [1] 
and minimizes the right hand side of (112p and ()13p . We shall mostly use this one in the following, 
but there are some cases where it is easier to use some (sharp) upper bound of the dimension of 
Sm instead of dim(5m) itself, see for example Section [^31 

The largest parameter /3 fulfilling Condition (jlip is 

1 ]^ f logn ^1 



We suggest to use this value, since it minimizes the right hand side of (|12p and (jl3p . Nevertheless, 
as discussed in Section 13.21 for the situation of Section 12.21 it is sometimes possible to use larger 
values for (3. 

Finally, we would like to compare the bounds of Theorem [1] with the minimum over M of 
the risks given by Roughly, the Bound (fT3|) states that the estimator jl achieves the best 
trade-off between the bias — n^^yup/cr^ and the complexity term Cm = Lm — logTr^- More 
precisely, we derive from the (cruder) bound 

(15) E(||/i-/i||2) < (l + e„) \ni ^\\^JL-Iis^^JLf + \Crn<y^\+Rla\ 

with 



— and i?; = — + * — sup Cm- 

2nlogn 2 logn pN*a'^ 



In particular, if Cm is of order dim(5m)j then (|15p allows to compare the risk of fi with the 
infimum of the risks We discuss this point in the following example. 

Example. Assume that the family Ai has an index of complexity (M, o), as defined in [2], 
which means that 

\{meM, dim{Sm) = d} | < Me"'^, for all d > 1. 
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If we choose, example given, 

g-(a+l/2)dim{<S™) 

(16) 7r^ = ^ g-(a+i/2)dim(5„0 = dim(5„)/2, 

we have Cm < (a + l)dim(5m) + log (3M). Therefore, when f3 is given by and < ku for 
some K < 1, we have 

E{\\fx-fif) <(! + £„) inf |||/x-n5„/x||2 + ^dim(5„)a2|+<cT2, 

meM I p ) 

with 

, ^ log(3M) 1 ||/x-n5,^f (g + 1)k + n-^ log(3M) 
/3 21ogn (j2 ^ 

In particular, for a given index of complexity (M, a) and a given At, the previous bound gives an 
oracle inequality. 



3.2. On the choice of the parameter (3. The choice of the tuning parameter (5 is important in 
practice. Theorem[T]or Theorem 5 in [17J justify the choice of a /3 smaller than 1/4. Nevertheless, 
Bayesian arguments |l5j suggest to take a larger value for /3, namely (3 = 1/2. In this section, 
we discuss this issue on the example of Section 12.21 



For the sake of simplicity, we will restrict to the case where the variance is known. We consider 
the weights dH proposed by Leung and Barron, with the probability distribution tt given by d?]) 
with a = 1, nameljo 



vr^ = (l+p-i)-^p-H 



According to (l9|), the estimator fl takes the form 

(17) A = 2_. ■5/3(^i/^)^j with Zj =< Y, Vj > and S/siz) = . 

j=i + e 

To start with, we note that a choice /3 > 1/2 is not to be recommanded. Indeed, we can compare 
the shrinkage coefficient si3{Zj/a) to a threshold at level T = (2 + [3~^ \ogp)a'^ since 

1 

Sf5{Zj/a) > - 1{22>T}- 

For /i = 0, the risk of fi is then larger than a quarter of the risk of the threshold estimator 
At = Ei=i ^{zf>T}^j namely 

E (l|0 - /if) = j2^{sp{Z,/cjfz]) >\jZ^ (l{z|>T}^l) = \^ (l|0 - M?) ■ 



-'^Note that this choice of a minimizes the rate of growth of — log -Km when p goes to infinity since 
- logTTm = plog (l +p"°) + Q|m| logp = p^'" + a\m\log p + o [p^~") , 
when TVm is given by ((Tjl with a > 0. 
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Figure 1. Plot oi ci/iiv) 

Now, when the threshold T is of order 2K\ogp with K <1, the threshold estimator is known 
to behave poorly for /i = 0, see [7] Section 7.2. Therefore, a choice (5 > 1/2 would give poor 
results at least when /x = 0. 



On the other hand, next proposition justifies the use of any [i < 1/2 by a risk bound similar 
p > 1 ; 

C/3(P) 



to (fT3|) . For p > I and /3 > 0, we introduce the numerical constants 7/3 (p) = \/2 + P ^logp and 

{x-{x + z)sp{x + z)f e-^'/2 dz/y/2^ 



sup 



min(x2, 7^(p)2) + jpip^/p 



V0.6. 



This constant c/3{p) can be numerically computed. For example, Ci/2{p) ^ 1 for any 3 < p < 10^, 
see Figure [TJ 



Proposition 2. For 3 < p < n and /3 G [1/4, 1/2], t/ie Euclidean risk of the estimator |T7| j is 

upper bounded by 

(18) 

E (IIm - /if) < \\p - Us,f^f + c^(p) [lln^.M - n5™./if + (2 + logp) (|m| + l)^^] . 
The constant cp{p) is (crudely) hounded by 16 when p > 3 and (3 £ [1/4, 1/2]. 



We delayed the proof of Proposition [2] to Section 16.41 We also emphasize that the bound 
c/sip) < 16 is crude. 

In light of the above result, (3 = 1/2 seems to be a good choice in this case and corresponds to 
the choice of Hartigan [15]. Note that the choice (3=1/2 has no reason to be a good choice in 
other situations. Indeed, a different choice of a in ([7|) would give different "good" values for /?. 
For a > 1, one may check that the "good" values for (3 are (3 < a/2. 
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Remark 3. Proposition [2] does not provide an oracle inequality. The Bound (jlSp differs from 
the best trade off between the bias and the variance term by a logp factor. This is unavoidable 
from a minimax point of view as noticed in Donoho and Johnstone [13]. 

Remark 4. A similar analysis can be done for the estimator ([8]) when the variance is unknown. 
When the parameters a and 6 in ([8]) equal 1, we can justify the use of values of /? fulfilling 

P<-<P'^( ) , forn > p > 3, 

2 \n — p J 

see the Appendix. 



4. Choice of the models and the weights in different settings 



In this section, we propose some choices of weights in three situations: the linear regression, the 
estimation of functions with bounded variation and regression in Besov spaces. 



4.1. Linear regression. We consider the case where the signal n depends linearly on some 
observed explanatory variables x^^\ . . . , x^p\ namely 

p 

fii = '^9jxl^\ i = l,...,n. 
i=i 

The index i usually corresponds to an index of the experiment or to a time of observation. The 
number p of variables may be large, but we assume here that p is bounded by n — 3. 



4.1.1. The case of ordered variables. In some favorable situations, the explanatory variables 
x^^^ , . . . , x^P^ are naturally ordered. In this case, we will consider the models spanned by the 
m first explanatory variables, with m ranging from to p. In this direction, we set Sq = {0} 

and Sm = span . . . , x*^™'^ } for m € {1, . . . where x^^^ = {xi \ . . . , x^^)'. This collection 
of models is indexed by = {0, . . . ,p} and contains one model per dimension. Note that 5* 
coincides here with Sp. 

We may use in this case the priors 

e" - I 

TTm = e m = 0, . . . , p, 

with a > 0, set = m/2 and takes the value (jl4p for with A'^* = n — p. Then, according to 
Theorem [1] the performance of our procedure is controlled by 



E(||/i-Af) 



< (1 + En) mi{ y - Us^pr + -T (« + V2)m }+1.2— log — — + 



meM['" /3 ' J l3 Ve°-1/ 21ogn 
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with e„ = (2nlogn)~^ and ct^ = cr^ + — ILs^fiW^ /{n — p). As mentioned at the end of 
Section 13.11 the previous bound can be formulated as an oracle inequality when imposing the 
condition p < nn, for some k < 1. 



4.1.2. The case of unordered variables. When there is no natural order on the explanatory 
variables, we have to consider a larger collection of models. Set some q<p which represents the 
maximal number of explanatory variables we want to take into account. Then, we write M for 
all the subsets of {1, . . . ,p} of size less than q and Sm = span jx^-'), j G m} for any nonempty 
m. We also set 50 = {0} and 5^, = span |. Note that the cardinality of M is of 

order p'^, so when p is large the value q should remain small in practice. 

A possible choice for iTm is 

with Hq = Y,j—^ < l + log((7 + l). 

d=0 

Again, we choose the value (jl4p for /? with A'^* = n—p and = \m\/2. With this choice, 
combining the inequality (|^|) < (e|m|/p)l"^l with Theorem [1] gives the following bound on the 
risk of the procedure 

E(||/i-/if) 

( ~2 

< [i + En] mi \ \\^i-Ils^^lf + ^ 

■meM I p 
2 - 2 

+ ;71^ + ^loglog[('? + l)e], 
2 log n p 

with En = (2nlogn)"-'^ and a"^ = a"^ + Wji — Hs^l^W^ /{n ~P)- 

Remark. When the family {^x^'^\ . . . is orthogonal and q = p, we fall into the setting 

of Section 12.21 An alternative in this case is to use fi given by ([5]) , which is easy to compute 
numerically. 



P 

m\ 



\m\ + l)Ha 



\m\ 



3/2 + log ^ ) + log(|m| + 1) 
\m\ 



4.2. Estimation of BV functions. We consider here the functional setting 
(19) ^ii = f{xi), i = l,...,n 

where / : [0, 1] — > M is an unknown function and deterministic points of [0, 1]. 

We assume for simplicity that = xi < X2 < • • • < a;„ < Xn+i = 1 and n = 2-^" > 8. We set 
J* = and A* = u/loA(j) with A(0) = {(0,0)} and A(j) = {j} x {O, . . .,2^'^ - l} for 

j >1. For (j, k) G A* we define vj^k G IR" by 

[^,- fc] . = 2(^--i)/2 (1^+ (i) - 1 (i)), ^ = 1, . . . , n 

with 7+^ = {l + {2k + 1)2-^ n, ...,{2k + 2)2~Jn} and /"^ = {l + 2k2-^n, ...,{2k + l)2"Jn}. 
The family {vj^k^ {j, G A*} corresponds to the image of the points xi, . . . , x„ by a Haar basis 
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(see Section 16. 5p and it is orthonormal for the scalar product 

1 " 

<x,y >n= - y^XiVi. 



n 

i=l 



We use the coUection of models Sm = span{vj^k, ij,k) £ m} indexed by = V{A*) and fall 

into the setting of Section [2^21 We choose the distribution n given by ([7]) with p = n/2 and 
a = 1. We also set 6 = 1 and take some /3 fulfilling 

^^l^_.^21„g(„/2) 



2" V ^ 

According to Proposition [1] the estimator ([6]) then takes the form 

^ / Z,,, exp (nPZlJaA \ 

j=o fceA(j) \en/2 + exp [npz'jj^/a'^j J 

with Zj^f: =< Y,Vj^k >n and 

^'=2(<y,y>i-j: Zl 

\ j=o fceA(i) 

Next corollary gives the rate of convergence of this estimator when / has bounded variation, in 
terms of the norm || • ||„ induced by the scalar product < •, • >„. 

Corollary 1. In the setting described above, there exists a numerical constant C such that for 
any function f with bounded variation V{f) 



la < c^Jfm^V" , nil , 

' I V / n I 



The proof is delayed to Section [6.51 The minimax rate in this setting is (y(/)(T^/n)^/'^. So, the 
rate of convergence of the estimator differs from the minimax rate by a (logn)^/'^ factor. We 
can actually obtain a rate-minimax estimator by using a smaller collection of models similar to 
the one introduced in the next section, but we lose then Formula (1201). 



4.3. Regression on Besov space i3poo[0)l]' We consider again the setting (fT9l) with / : 
[0, 1] ^ R and introduce a -^^^([0, 1], (ix)-orthonormal family {(l)j,kjj > 0, A; = 1 . . . 2-' | of com- 
pactly support wavelets with regularity r. We will use models generated by finite subsets of 
wavelets. If we want that our estimator shares some good adaptive properties on Besov spaces, 
we shall introduce a family of models induced by the compression algorithm of Birge and Mas- 
sart |7j. This collection turns to be slightly more intricate than the family used in the previous 
section. We start with some k < 1 and set J* = [log(Kn/2)/log 2j . The largest approximation 
space we will consider is 

= span {4>j,k,j = . . . J*, /c = 1 . . . 2-'} , 
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whose dimension is bounded by ku. For 1 < J < J^, we define 

J, 



Mj={ 



m = \^ {j} X Aj, with Aj £ Ajj 

j=0 



where Ajj = 1 1 1, . . . , 2-' 1 1 when j < J — I and 

Ajj={Ac{l,...,2^}: \A\ = [2-^/{j-J+lf\}, whenJ<i<J,. 
To m in = Uj*^^M. j, we associate J^m = span{(/)j fc, (j, k) G m} and define the model Sm by 

5„^ = {(/(Xi), . . . , /(Xn)), / G ^rn} C 5, = {(/(xi), . . . , /(x,)), / G • 

When m G A4j, the dimension of Sm is bounded from above by 



J-i J. 



(21) 



dim(5^) <Y.'^' + Y1 



(j - ^ + 1 



< 2 



J 



J*-J+i 



1+ k~ 



j=o j=J 

and dim(5=K) < ku. Note also that the cardinality of j is 

J* 



k=l 



< 2.2 • 2 



J 



2J 



L2^/(j- J + l)2 



To estimate yu, we use the estimator /i given by ([6]) with /3 given by (fH|) and 

-1 



Lm = 1.1 • 2'^ and vr, 



2-^(1 - 2-^*) 



2^ 



[2J/{j-J + l) 



for m G A^j. 



Next corollary gives the rate of convergence of the estimator fi when / belongs to some Besov 
ball Bp f^{R) with 1/p < a < r (we refer to De Vore and Lorentz [12] for a precise definition 
of Besov spaces). As it is usual in this setting, we express the result in terms of the norm 
II . ||2 = II . ||2/^ on M'^. 

Corollary 2. For any p,R> and a £]l/p,r[, there exists some constant C not depending on 
n and o"^ such that the estimator p, defined above fulfills 



E{\\fi- fi\\l) < Cmax 



for any fj, given by / fTPj) with f G l3p ,^{R). 



2\ 2a/(2a+l) 



1 



' n2(°-Vp) ' 



n 



The proof is delayed to Section [6.61 We remind that the rate (cr^/n)^"''^^"^^^ is minimax in this 
framework, see Yang and Barron [24j. 
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0.1 02 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 



Figure 2 . Recovering a signal from noisy observations (the crosses) . The signal 
is in black, the estimator is in red. 



5. A NUMERICAL ILLUSTRATION 



We illustrate the use of our procedure on a numerical simulation. We start from a signal 

f{x) = 0.7cos(x) +cos(7x) + 1.5sin(x) + 0.8sin(5x) + 0.9 sin(83;), x G [0,1] 

which is in black in Figure [2j We have n = 60 noisy observations of this signal 

Yi = f (xi) + aSi, i = 1, . . . , 60, (the crosses in Figure [2]) 

where Xj = i/60 and ei,...,e60 are 60 i.i.d. standard Gaussian random variables. The noise 
level a is not known (cr equals 1 in Figure [2]). To estimate / we will expand the observations on 
the Fourier basis 

{1, cos(27rx), . . . , cos(407rx), sin(27r2;), . . . , sin(407rx)} . 
In this direction, we introduce the p = 41 vectors vi, . . . , V41 given by 

^ sin(27rjxi), . . . , sin{2TTjXn)^ when j G {1, . . . , 20} 

Ul^---^\fl) ^ wheni = 21 

cos(27r(j - 21)xi), . . . , cos(2^(j - 21)x„)) when j G {22, . . . , 41} . 

This vectors {fi, . . . ,f4i} form an orthonormal family for the usual scalar product of M" and 
we fall into the setting of Section [ 
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We estimate (/(xi), . . . , /(x^))' with fi given by dS]) with the parameter a = 1, 5 = 1 and 
f3 = 1/3. Finahy, we estimate / with 

20 20 

f{x) = ao + cij cos(27rjx) + 6j sin(27rja;) (in red in Figure [2]) 
i=i i=i 

where oq = ^/{l/n) < /i,V2i > and Uj = \J {2/n) < /i, Wj+21 >, bj = \J {2/n) < /l,Vj > for 
j = 1, ... ,20. The plot of / is in red in Figure [2l 



6. Proofs 



6.1. Proof of Proposition [H Let us first express the weights Wm in terms of the Zjs and 

r = alogp + b. We use below the convention, that the sum of zero term is equal to zero. Then 
for any m £ M, we have 

exp{P\\flra\\^/a^ - {alogp + b)\m\) 



Wr. 



T^m'eM exp(/3||/im/P/o-2 - {alogp + b)\m'\) 
ew{Zken.if3Z!/a'-T)) 

Em'eM (Efcem' {PZl/^^ - ^)) ' 
We write Mj for the set of all the subsets of {1, . . . , j — 1, j + 1, . . . ,p}. Then, for any j £ 
{l,...,p} 

Cj = ^ ^ ^jdrnUJm 

EmeM Ijgmexp (^fcg^ {PZl/a^ - r)) 
Note that any subset m £ M with j inside can be written as {j} U m' with m' E A^j, so that 

Em'eA4, exp (EfcGm' {PZl/a^ EmeMi ^"^P {Ek€m (Z?^!/'^^ " ^)) 



Formula dHI) follows. 



6.2. A preliminary lemma. 

Lemma 1. Consider an integer N larger than 2 and a random variable X, such that NX is 
distributed as a of dimension N . Then, for any < a < 1, 

2 



(22) 



E [{a-X)^] < E 



^-1 
X 



< 



{l-a){N 



■exp(-iV0(a)) 



with 4>{a) = i (a — 1 — log a) > 4(1 — a)^ 
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Proof: Remind first that when g : M'^ and / : have opposite monotonicity, 

E[g{X)f{X)] <E[g{X)]E[f{X)]. 
Setting g(x) = x and f{x) = (ajx — \)^ leads to the first inequality since E(X) = 1. 



15 



We now turn to the second inequality. To start with, we note that 













= aE 


[a 














f + OO 








p 








L/a 





dt. 



Markov inequality gives for any A > 

1 



2A\ 



-Af/2 



and choosing A = A^(t — l)/2 leads to 
(23) 



1\ 1 X 
X < - j < exp {-(\ - ) < exp(-iV</>(l/t)) 



for any t > 1. Putting pieces together ensures the bound 



E 



X 



< a 



< a 



hoo 

exp -(1-1/f 



l/a 



dt 



exp 



^(l-x))x^/2-2dx 



for any < a < 1. Iterating integrations by parts leads to 

'N, A 2a^/2 



E 



^-1 
X 



exp( .(1 a)] 2^ ^,_^i^^^2 + i) 



N 



N 



< exp(-(l-a)^ (iV-2)(l-a) 



A:>0 

2a^/2 



< 



(l-a)(iV-2) 
for < a < 1, and the bound (f22]l follows. 



exp {-N(l){a)) 



6.3. Proof of Theorem [H To keep formulas short, we write dm for the dimension of Sm, and 
use the following notations for the various projections 

/i*=n5,Y', ^^=ns,n and /x^ = II^^^u, m £ M. 

It is also convenient to write Wm in the form 

|2 

Wm. = -37 exp I -u— — — ■ Lr 



with Z' = e 
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By construction, the estimator (x belongs to 5*, so according to Pythagorean equality we have 
11^ - 

the second part. 



^ = ll/x — A** IP + 11/^* The first part is non random, and we only need to control 



According to Theorem 1 in Leung and Barron [T7], the Stein's unbiased estimate of the L^-risk 
IE (ll/x,, — /c7^ of the estimator fi on 5* can be written as 



S{fi)= ^ Wr, 



a2 



• (A - Am) 



When expanding the gradient into 



V 



(A - Am) 



2 (A* - Ar) 



-. (/i - flm) + IIA* - AmiPV (l/o-^) . (A - Am) , 



the term — /impV (l/(7^) . (A — Am) turns to be 0, since V(1/(T^) is orthogonal to S^, 
Furthermore, the sum 

Wmifi- A*) • (A - Am) 



mgyVf 



also equals 0, so an unbiased estimate of E — AlP) /'^'^ is 



S(A)= E 



mgAI 



+ 2dm+ ^ 



- 2 \ II Ii2 
\ ll/^m - /^ll 



(t2 



We control the last term thanks to the upper bound 



E 



^m 1 1 fJ'-m 1 1 
mgAI mgAI 

< E -WmllA* - A 
meAl 



Ell - _ ~ ||2 _ II ' _ - ||2 
WmWI^'* /^m|| HA* 



and get 



S(A) < 



+ 4/3 



< 



+ 4/3 



t^m ( 2d 



mSAI 



E 
E 



1/^* /^m| 



+ 2 E Wmdm - d* 



mGA4 



meA4 
2 



0" 



+ 4/3 



lA* Am|| -^'m 



d* 



where (x)+ = max(0, x). First note that when Lm > dm/2 we have 



2d„ 



T 



< 



2/3cj2 2/3 V 



< min 0, 2 



2/3cj2 



dm, < 0. 



MIXING LEAST-SQUARES ESTIMATORS WHEN THE VARIANCE IS UNKNOWN 

Therefore, setting 6fs = (4/3cj^/(T^ — l)^ we get 

|2 T 

1 
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Let us introduce the Kullback divergence between two probabihty distributions {am, m G Ai} 
and {-Km, m £ ■M} on Ai 

Or, 
■K„ 



V{a\TT)= amlog— > 



and the function 



+ 



+ ^P(a|7r). 



The latter function is convex on the simplex = {a G [0, Ij'-'^l, X]meA4 '^»" ~ ^} 

be interpreted as a free energy function. Therefore, it is minimal for the Gibbs measure 

{wm, m G M.{ and for any a E S*^, 



m&M 



\f^* Amll 



< 1 + .S, 



/3 



I * " II 2 "2 
I /^* /^m II T 

2 n 1 " 



+ -5-^P(a|vr) 



We fix a probability distribution a G S'_j{^ and take the expectation in the last inequality to get 
E[5(/l)]< 



1 + E(,5^)) 



I /^* I^Ti 



+ E 



r(l + '5/3) 



E "m^m + I'(a|7r) 



Since a'^/a'^ is stochastically larger than a random variable X with x^i^*)/^* distribution. 
Lemma [T] ensures that the two expectations 



E 



are bounded by 



E 



and E 



E 



■exp(-iV,</.(4/?)) 



(l-4/3)(iV, -2) 

with (t){x) = (x — 1 — log(x))/2. Furthermore, the condition A^^= > 2 + (log n)/(/)(4/3) enforces 
2 



:i-4/3)(iV, -2; 



.exp(-iV,<A(4/3)) < ^ ^TT— 

(1 — 4pjn log n 2n log n 
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Putting pieces together, we obtain 



eHI^-AII 



< 



a2 



+ (1 + En) "rn^ 



+ 



a 



amLm + Vialir) 



< 



1/^ - fJ-*\ 
a2 



+ (1 ^ Or 

in£M 



This inequahty holds for any non-random probabiHty distribution a G Sj^, so it holds in par- 
ticular for the Gibbs measure 



am = ^ exp 

2/3 



, m e M 



where Zf^ normalizes the sum of the am.s to one. For this choice of am. we obtain 



< — log 



.meM 



which ensures (112]) since d^, < n. To get (1131) simply note that 

^ vr^exp {Wfi- HmW"^ - dmcr"^) - L 

m£M 

> exp 

for any m* G Ai. 



(II// - ^im*f - dm*0-'^) - Lrn* - log VT^ 



6.4. Proof of Proposition [2], We use along the proof the notations = H^^/i, = 

max(x,0) and 7 = 7/3(p) = \/2 + /3~^ logp. We omit the proof of the bound cp{p) < 16 for 
P £ [1/4, 1/2] and p > 3. This proof (of minor interest) can be found in the Appendix. 

The risk of fi is given by 

p 

E(||/i-/xf) = ||/x-/i,f + ^e((</x,?;,- > -sp{Zj/a)Zjf) . 
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Note that Zj =< fi,Vj > +a < e,Vj >, with < e,Vj > distributed as a standard Gaussian 
random variable. As a consequence, when \ < n,Vj > \ < 4'ja we have 

(24) E ((< fi,vj > -SfsiZj/a)Z,f^ < cp{p) [min(< ii,Vj >^ + -f^a^ /p] . 

If we prove the same inequahty for | < /i, > | > 47C7, then 

p 

^[\\^i-^J.f) < +c;3(p)^min(< >2,7V2) 

i=i 

< 11/^ - + c/3(p) inf ni/i* - //m|P +7^(l"^l + 1)0"^] • 
This last inequality is exactly the Bound p8p . 

To conclude the proof of Proposition [21 we need to check that 

E {^{x - sp{x + Z){x + Z)f^ < Cfsip)-/'^ for x > 47, 
where Z is distributed as a standard Gaussian random variable. We first note that 

E(^{x-Sf3{x + Z){x + Z)f^ < 2x'^E(^{l-Sfs{x + Z)f'^+2E{sf3{x + ZfZ^) 

< 2x'^E(^{l-Sf3{x + Z)f'^ +2. 
We can bound (1 — sp{x + Z))'^ as follows 

< exp(-2/?[(x + Z)2-72]_^) 

< exp(-l[(x + Z)2-7l+) 
where the last inequality comes from /5 > 1/4. We then obtain 

E(^{x-sp{x + Z){x + Z)f) < 2 + 4x2E(^exp(^-i [(x-Z)2-72]^^ l^>o) 

< 2 + 4x^ P(0 < Z < a;/2)exp ^-i [(x/2)2 - 7^]^ +P(Z > x/2) 

< 2 + 2x^ exp {-x^/8 + j^/2) + 4x^ exp {-x^/8) . 
When p > 3 and /3 < 1/2, we have 7 > \/2 + 2 log 3 and then 

sup x^e-^"/*^ = I672 exp(-272). 

a;>47 

Therefore, 

E (^(x - sp{x + Z)(x + < 2 + 167^ (^2e-^^'/2 + 46-^^") < 0.67^ < Cfsip)-/^, 

where we used again the bound 7^ > 2 + 2 log 3. The proof of Proposition [2] is complete. 
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6.5. Proof of Corollary [H We start by proving some results on the approximation of BV 
functions with the Haar wavelets. 

6.5.1. Approximation of BV functions. We stick to the setting of Section 14.21 with = xi < 

X2 < ■ ■ ■ < Xn < Xn+i = 1 and n = 2'^". For < j < J„ and p G A{j) we define tj^p = Xp2-3n+i- 
We also set (/)o,o = 1 and 

<^,,, = 2(^-i)/2 ,,^,) - > for 1 < i < J„ and k G A(i). 

This family of Haar wavelets is orthonormal for the positive semi-definite quadratic form 

1 " 

(/,£/)n = -^/(^^^(a^i) 
1=1 

on functions mapping [0, 1] into M. For < J < J„, we write /j for the projection of / onto the 
linear space spanned by {(pj^k, 1^ j ^ J, k £ A(j)} with respect to (•, ■)n, namely 

J 

/j = ^ ^ Cj,k4>j,k, with Cj^k = if ,4'j,k)n- 
j=0 fcGA(i) 

We also consider for 1 < J < J„ an approximation of f d la Birge and Massart [6] 

fj = fj-1 + X] X] '^J'k4'j,k, 

j=Jk(^A'j{j) 

where A'j{j) C A(j) is the set of indices k we obtain when we select the Kj j largest coefficients 
|cj,fc| amongs {\cj^k\, k e A{j)}, with Kjj = [{j - J + l)~^2'^~2j for 1 < J < j < J„. Note that 
the number of coefficients Cj^k in fj is bounded from above by 

1 + ^'2i-i + J2^j _ J + i)-32^-2 < 2-^-1 + 2'^-2 < 2^. 

j=i i>j p>i 

Next proposition states approximation bounds for fj and fj in terms of the (semi-)norm ||/||^ = 

(/,/)n. 

Proposition 3. When f has bounded variation V{f), we have 

(25) \\f-fj\\n<2V{f)2-'^\ forJ>0 
and 

(26) ||/-/j||n<cF(/)2-^ /orJ>l 
OTi/ic = X;p>iP^2-P/2+i. 

Formulaes (|25p and (j26p are based on the following fact. 

Lemma 2. When f has bounded variation V{f), we have 

E < 2-(^+i)/V(/), for 1 < i < J„. 

fc6A(i) 
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Proof of the Lemma. We assume for simplicity that / is non-decreasing. Then, we have 

20-i)/2 



Cj,k —< f, 4>j,k >n- 



n 



j.k ^ j,k 

with I^i^ and /^^^ defined in Section [121 Since = 2~^n and / is non-decreasing 

2(i-i)/2 



\Cj,k\ < \I^^k\[fi^{2k+2)2~Jn)-fiX(2k)2-in)] 



and Lemma [2] follows. 



□ 



We first prove (p5]) . Since the k £ A(j)} have disjoint supports, we have for < J < 

11/ - fj\\n ^ X] II X] '^3,k4'j,k\\n 



j>J fcGA(j) 



1/2 



|Cj,fcP ||</'j,fc||n 

i>JfceA(i) 

Formula (j25p then follows from Lemma [2j 

To prove ([26]) we introduce the set Aj(j) = A{j) \ Then, for 1 < J < J„ we have 



\\f-fj\\n < E 



1/2 



Jn 



X] |Cj,fc|^ ll</'j,fc|ln 



max |cjfc| \^ \cjk\ 



1/2 



The choice of A'j{j) enforces the inequalities 

+ max |cj-fc| < \^lk\+ Yl l^i.fcl - Yl l^i-^l' 

■'^■''^ fceA';(j) fceA'j(j) fcGA(i) 

To complete the proof of Proposition [3l we combine this bound with Lemma [2| 

ll/-/j||„ < 5]2-(^-+i)/M/)(i + i^,,j)~'/' 

< ^ 2-(^+i)/V(/)2-(^-2)/2(j - J + 1)3. 

< Vif)2-'Y.ph-P/^+\ 
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6.5.2. Proof of CoroUary{l\ First, note that vj^k = {4'j,k{xi), . . . , 4>j^k{xn))' for (j, k) G A*. Then, 
according to (|25p and (|26p there exists for any < J < J* a model m G fufilhng |m| < 2"^ 
and 

11/^ - ns^/^lln = ll/^-n5,/^ll^ + l|n5.M-n5^/x||^ 



with c = X^p>i ^^2"^/^"^^. Putting together this approximation result with Theorem [T] gives 



^ ^ 0<J<J* 



n 



for some numerical constant C, when 



4"^ V«/2-2 

This bound still holds true when 

^M'l ( log'^ ^ < /? < i^-i /^log(n/2) 



n/2 - 2y - ^ - 2^ V n/2 

see Proposition [3] in the Appendix. To conclude the proof of Corollary [H we apply the previous 
bound with J given by the minimum between J* and the smallest integer such that 



, 1/3 

2^ > ' 



cr^ log n 



6.6. Proof of Corollary [2l First, according to the inequality (^) < (en/k)^ we have the bound 
for m G Mj 

J* 



log TTm < log 2-^ + 5] {j-l+iy^ (e2^--^+i _ J + 1)3) 



< 2-^ 1^1 + ^ A;-3 (1 + 3 log A: + A; log 2) 
(27) < 4-2'^. 



k>l 



Second, when / belongs to some Besov ball Bp^^{R) with 1/p < a < r, Birge and Massart [6] 
gives the following approximation results. There exists a constant C > 0, such that for any 
J < J=K and / G Bp^^{R), there exists m G TWj fulfihing 

11/ - n^™/l|oo < 11/ - n^./lloc + lin^J - n^^/IU < Cmax (^2-^.(-i/p),2-"^) 
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where Hjf denotes the orthogonal projector onto T in i^^([0, 1]). In particular, under the previous 
assumptions we have 



n 

i=l 



(28) < ||/-n^„J||L<C2maxf2-2°^,2-2-^*(-i/^') 



To conclude the proof of Corollary [21 we combine Theorem [T] together with (|2ip , (j27p and ([28 
for 



J = min I J* , 



log[max(n/o-^, 1)] ^ 
(2a + 1) log 2 
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APPENDIX 



The Appendix is devoted to the proof of the bound c^(p) < 16 and gives further explanations 
on the Remark 4, Section [3.21 The results are stated in Appendix Rl and the proofs (of minor 
interest) can be found in Appendix IbI 



Appendix A. Two bounds 



A.l. When the variance is known. In this section, we assume that the noise level a is known. 
We remind that 3 < p < n and {vi, . . . ,Vp} is an orthonormal family of vectors in M". Next 
lemma gives an upper bound on the Euclidean risk of the estimator 



with A > 2 and Zj =< Y, Vj >, j = 1, . 



Lemma 3. For any A > 2 and /? > 0, we set 7^ = A + f3~^ logp. 



,P- 



1. For l/4</?<l/2 and a £M we have the bound 



(29) 



E 



exp {f3{a + e)2) 



p exp(/3A) + exp (/3(a + e)^ 



-(a + e) 



< 16 [min (0^,7^) + 7^/p] , 



where e is distributed as a standard Gaussian random variable. 
2. As a consequence, for any < /3 < 1/2 and A > 2 the risk of the estimator jl is upper 
bounded by 



(30) 



2^21 



E ll/i - Af < 16 inf ll^-^^f + 7^|m|f7^ + 7V 

ni&M 



Note that (j29p enforces the bound c^(p) < 16. This constant 16 is certainly far from optimal. 
Indeed, when (3 = 1/2, A = 2 and 2> < p < 10^, Figure 1 in Section [3.21 shows that the bounds 
(1291) and ([30]) hold with 16 replaced by 1. 



A. 2. When the variance is unknown. We consider the same framework, except that the 
noise level a is not known. Next proposition provides a risk bound for the estimator 



P exp ( /3Z//<7^ 

(31) jl = y^{cjZj)vj, with Zj =< Y,Vj > and Cj = y 

j=i pexp (b) + exp ((3Zj/a'^ 

We remind that (j){x) = (x — 1 — log x) jl . 
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Proposition 4. Assume that (3 and p fulfill the conditions, 



(32) 



P>3, 0</3<l/2 and p + 



logp 



< n. 



Assume also that b is not smaller than 1. Then, we have the following upper bound on the 
L'^-risk of the estimator fi defined by ^31\) 



(33) E (11^- /if) < 16 inf 



1^ - M»i 



1 



\ogp) (|m| + 1)^2 + (2 + 6 + \ogp)a'^ 



where a"^ = a"^ + ||/i — n^^/ip/ {n — p). 



We postpone the proof of Proposition [H to Section [B.2I 



Appendix B. Proofs 



B.l. Proof of Lemma [3l When /? < 1/4 the bound ([30|) fohows from a slight variation of 
Theorem 5 in [T7]. When 1/4 < /3 < 1/2, it fohows from (p9]) (after a rescahng by o"^ and a 
summation). So ah we need is to prove (j29p . For symmetry reason, we restrict to the case a > 0. 



To prove Inequahty (j29|) . we first note that 
exp (/3(a + ef) 



E 



p exp(/3A) + exp {I3{a + sY] 



(a + e) 



E 



l+p-iexp(/3[(o + e)2 - A]) 1 +pexp(/3 [A - (a + e)2]) 

2' 



(34) < 2a2E 



1 



1+p-iexp {j3[{a + ef -\]) 



+ 2E 



l+pexp(/3 [A - (a + e)2]) 



and then investigate apart the four cases 0<a<27 ^, 27 ^<a<l, l<a< \/3 7 and 
a > \/3 7. Inequality ^ will follow from Inequalities ([36]), ([37]), ([Ml) and (|39]) . 



Case < a < 27"^ From ([HI) we get 



E 



exp (/3(a + . 



pexp(/3A) + exp {f3{a + e)2) 



(a + e) 



< 2a^ + 2E 



< 2a2 + 4E 



l+pexp(/? [A - (o + e)2]) 



el 



e>0 



,l+pexp(/? [A - (a + e)2]) 
< 2a^ + 4E e2l^>oexp (-2/3 [72 - (a + e)2 
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with 7^ = A + logp. Expanding tiie expectation gives 



27 



E 



/>oo 

= + / x^e 
v27r 77-a \/27r 



2 -x'^/2 



Note that when p > 3, X > 2 and /3 < 1/2, we have 7 > 2. Therefore, when < a < 1, an 
integration by parts in the second integral gives 



roo 
J 7— a 



2 2/2 dx 



X e 



27r 



-xe 



< 



27r 
2(7 -a) 



7— a 



7— a 



27r 



-(7-a)V2 



27r 



For the first integral, since /? > 1/4, we have the bound 



^ ''^ "3,2g2/3(a+a;)2-a;V2 < g-(7- 



/ 

Jo 



10 V27r 
Besides an integration by parts also gives 



■afl2 r 

Jo 



,2 ^ < (7 - af g-(7-a)V2 

V2k 3^2^ 



-2/372 r "^2g2/3(a+x)2-xV2 



/ 

Jo 



27r 



< (4/? - l)-ie-2^^' /"^ "(4/3(a + x) - x)xe2^(«+^)'-^'/2 

p-2/37' 



< 



(4/3 - l)V27r L 



xe 



2f3{a+xf -x^ /2 



7-a 7 — a 



27r 

(7-a)V2 



(4/3 - l)V27f 



Putting pieces together, we obtain for < a < 1 and l/4</3<l/2 



E 



a — 



cxp (/3(a + e)^ 



-(a + e) 



< 2a^ + 



pcxp{pX) + cxp (/3(a -I- e)2) 

8 + 4min((4/3- l)-i,3-i(7-a)2) 



2tt 



(^_a)e-(7-«)V2 



8 + 4min f (4/3 - l)-\3-\j - a)2e-(V2-/3)(7-a)'') 
< 2a^ + ^ — ^(^_a)e-^(^-") 



27r 



< 2a^ + 



8 + 4min ((4/3 - l)"!, [3e(l/2 - /3)]-^) 



27r 



76 



-/3(7-a)2 



(35) < 2a^ + 5.67e- 



-/3(7-a)2 
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Furthermore, when < a < 27 ^, we have (7 — a)^ > 7^ — 4. The inequaUties ()35p and A 
thus give 

-(a + e) 1 



E 



exp (/3(a + e)2) 



pexp{/3X) + exp (/3(a + e)^) 



(36) 



< + 5.6e(^-^)^7/p 

< 2a2 + 87Vp- 



Case 27 ^ < a < 1. Starting from ([35]) we have for 27 ^ < a < 1 



E 



exp (/3(a + e)2) 



p exp(/3A) + exp (/3(a + e)^) 



(a + e) 



< 2a^ + — ' 



7^ 



< 2a^ 



5.673e-(7-i)V4 



7" 



(37) 



< 20^+567"^ < 160^. 



Case I < a < \/3 7- From ([M]) we get 

exp (/?(a + e)2) 



E 



pexp(/3A) + exp (/3(a + e)^; 



(a + e) 



< 2(0^ + 1) 



(38) 



< 4a^ < 12min(a^7^) 



Case a > V^j. From ([3^ we get 



E 



exp (/3(a + e)2) 



pexp(/3A) + exp (/3(a + e) 



-(a + e) 



< 4a^E 



1+p-iexp (/3[(a-e)2 - A]) 



Le>0 



+ 2 



+ 2 



exp (^-2/3 [{a-ef-j^]^) l,>o 
< 2a2 exp (^-2/3 [(2a/3)2 - 7^] ^) + 4a¥ (e > a/3) + 2. 
Since (2a/3)^ > a^/9 + 7^, we finahy obtain 

exp {P{a + e)2) 



E 



(39) 



pexp(/3A) + exp (/3(a + e)2) + j 

< 2a2 exp (-2/3aV9) + 40^ exp {-a'^/18) + 2 

< 42 < II72. 



MIXING LEAST-SQUARES ESTIMATORS WHEN THE VARIANCE IS UNKNOWN 



29 



B.2. Proof of Proposition |4l We can express the weights Cj appearing in (j3ip in the fohowing 

way 



pe^-^ + e J ' 



with P = Paya^ and A = b/p. Note that /3 < 1/2 enforces A > 2 since 6 > 1. 



Since a is independent of the ZjS, we can work conditionally on a . When /3 is not larger than 
1/2 we apply Lemma [3] and get 



- Mmll + A + /3 logp (|m| + l)a 



^{/3<l/2} 



(40) 



< 16 inf 



^{/3<l/2}- 



When /3 is larger than 1/2, we use the following bound. 

Lemma 4. Write Z for a Gaussian random variable with mean a and variance . Then for 
any A > and /3 > 1/2 

(41) E (a Z 



pe/3X + ^f3zya'^ 



< 6 (x + p-hogp) + 360-' 



Proof. First, we write e for a standard Gaussian random variable and obtain 

^-2/^2 \ 2 



E 



E 



ere 



I + p exp /3 



A - (a/cj + e)2 



l+p-iexp(^ (a/cr + e)2-A 
< 2(a2 + cr2). 

Whenever is smaller than 3 ( A + logp) o"^, this quantity remains smaller than 6 ( A + (3^^ logp) cr^+ 

When is larger than 3 ^A + (3^^ logp) cr^ we follow the same lines as in the last case of Section 
IB. II and get 

\ 2" 



E 



jzya^ 



Z 



< 2<T^ + 4a^P (e > |a|/(3a)) + 2aV exp 



-2/3 



2a 
3a 
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Since (2a/3)^ > 0^9 + ( A + /J^^ logp) a\ we obtain 

\ 2 



E 



< 2(7^ + 40^ exp (-aV(18cr2)) + exp {-a^/ida"^)) 

< 36cj2. 



□ 



From ()4ip . we obtain after summation 



(42) E (II/. -/if 1^2) l|^>i/2| < 



(b + logp) (7^ + 36^2 



Futhermore is smaller than 2/3cr^ when /3 is larger than 1/2, so taking the expectation of (^2 
and (00]) gives 



E(||//-/i||^) < 16 inf 



l/u - l^mf + 4 (ft + logp) (|m| + 1)0-2 



+ p[12(6 + logp) +36]P(ct2 < 2/?cj2)f72. 

The random variable a'^/a'^ is stochastically larger than a random variable X distributed as a 
of dimension n — p divided hy n — p. The Lemma [1] gives 



(X < 2/?) < (2/3)^/2 / _ 2/j) ) < exp {-N(j){2P)) 



N 



and then P {a^ < 2(5a'^) < exp(-(n - p)(j){2P)), so Condition ^ ensures that 

p¥{a^ < 2/30-2) < 1. 
Finally, since 12(6 + logp) + 36 is smaller than 16(2 + 6 + logp) we get ()33p . 
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